Here we take a quick first look at the data from the provided dataset.
library(scater, quietly = TRUE)
library(knitr)
## options
opts_chunk$set(fig.align = 'center', fig.width = 10, fig.height = 5, dev = 'png',
warning = FALSE, message = FALSE)
library(ggplot2)
library(viridis)
library(ggthemes)
library(cowplot)
library(magrittr)
library(destiny)
## load data
load(params$rdata_file)To help with the QC of the dataset, scater enables straight-forward calculation of many QC metrics with the calculateQCMetrics function. To help produce useful metrics, it is good to identify gene and cell controls, if present. In this case, we use ERCC spike-ins and mitochondrial genes as gene controls (called “feature_controls” in `scater).
But first, filter out genes with no expression at all - time-wasters.
nonzero_genes <- (rowSums(counts(sce_gene)) > 0)
sum(nonzero_genes)
sce_gene_filt <- sce_gene[nonzero_genes,]This leaves us with 36376 genes and 707 cells.
However, I will set a low threshold to call an observation “expressed” - any count above zero suffices. Below is a summary of the percentage of counts coming from ERCC spike-in controls and mitochondrial (MT) genes.
count_thresh <- 0
is_exprs(sce_gene_filt) <- (counts(sce_gene_filt) > count_thresh)
ercc_genes <- grepl("^ERCC", featureNames(sce_gene_filt))
mt_genes <- grepl("^MT-", fData(sce_gene_filt)$hgnc_symbol)
sce_gene_filt <- calculateQCMetrics(sce_gene_filt,
feature_controls = list(ERCC = ercc_genes, MT = mt_genes))stats <- summary(sce_gene_filt$pct_counts_feature_controls_ERCC)
kable(data.frame("Stat" = names(stats), "Pct Counts from ERCC" = as.vector(stats)))| Stat | Pct.Counts.from.ERCC |
|---|---|
| Min. | 0.00000 |
| 1st Qu. | 0.06476 |
| Median | 0.23270 |
| Mean | 2.02900 |
| 3rd Qu. | 0.43440 |
| Max. | 92.96000 |
stats_MT <- summary(sce_gene_filt$pct_counts_feature_controls_MT)
kable(data.frame("Stat" = names(stats_MT), "Pct Counts from MT" = as.vector(stats_MT)))| Stat | Pct.Counts.from.MT |
|---|---|
| Min. | 0.000 |
| 1st Qu. | 2.777 |
| Median | 4.218 |
| Mean | 8.221 |
| 3rd Qu. | 5.740 |
| Max. | 80.410 |
Plot number of expressed features against total counts, with cells coloured by different QC metrics:
p1 <- plotPhenoData(sce_gene_filt,
aes(x = log10_total_counts, y = total_features,
colour = filter_on_total_features)) +
theme(legend.position = "bottom") +
geom_vline(xintercept = 5, linetype = 2)
p2 <- plotPhenoData(sce_gene_filt,
aes(x = log10_total_counts, y = total_features,
colour = pct_counts_top_100_features)) +
theme(legend.position = "bottom") +
geom_vline(xintercept = 5, linetype = 2)
p3 <- plotPhenoData(sce_gene_filt,
aes(x = log10_total_counts, y = total_features,
colour = pct_counts_feature_controls_ERCC)) +
theme(legend.position = "bottom") +
geom_vline(xintercept = 5, linetype = 2)
plot_grid(p1, p2, p3, labels = c("A", "B", "C"),
nrow = 1)Look at percentage of genes not expressed (percentage dropout) for each cell plotted against the percentage of counts obtained from ERCC spike-in controls.
The cumulative expression plot (with cells coloured by percentage of counts coming from feature controls, i.e. ERCC and MT genes) shows a handful of low complexity libraries and otherwise a range of library complexities associated with the percentage of expression accounted for by feature controls..
p1 <- plot(sce_gene_filt, exprs_values = "counts",
colour_by = "pct_exprs_feature_controls_ERCC") +
theme(legend.position = "bottom")
p2 <- plot(sce_gene_filt, exprs_values = "counts",
colour_by = "pct_exprs_feature_controls_MT") +
theme(legend.position = "bottom")
plot_grid(p1, p2, labels = c("A", "B"), nrow = 1)Plot the most expressed genes across each dataset. Surprisingly, the ERCC spike-ins do not feature in the list of most-expressed genes. Many MT and ribosomal genes appear, along with GAPDH, as expected. The most expressed gene is the Y chromosome gene AC010970.2.
plotQC(sce_gene_filt)Plot expression frequency against mean expression. This highlights generally decreasing dropout with increasing mean expression.
Another option available in scater is to conduct PCA on a set of QC metrics. The advantage of doing this is that the QC metrics focus on technical aspects of the libraries that are likely to distinguish problematics cells. Automatic outlier detection on PCA plots using QC metrics is available to help identify potentially problematic cells.
We use the following metrics for PCA-based outlier detection:
pct_counts_top_100_featurestotal_featurespct_counts_feature_controls_MTpct_counts_feature_controls_ERCCn_detected_feature_controlslog10_counts_endogenous_featureslog10_counts_feature_controlsA particular set of variables to be used can be specified with the selected_variables argument as shown in the example below.
plotPCA(sce_gene_filt, size_by = "total_features",
shape_by = "filter_on_total_features",
pca_data_input = "pdata", detect_outliers = TRUE)## The following cells/samples are detected as outliers:
## quant_kallisto/19776_1#1
## quant_kallisto/19776_1#10
## quant_kallisto/19776_1#101
## quant_kallisto/19776_1#107
## quant_kallisto/19776_1#108
## quant_kallisto/19776_1#112
## quant_kallisto/19776_1#114
## quant_kallisto/19776_1#12
## quant_kallisto/19776_1#121
## quant_kallisto/19776_1#124
## quant_kallisto/19776_1#131
## quant_kallisto/19776_1#133
## quant_kallisto/19776_1#144
## quant_kallisto/19776_1#147
## quant_kallisto/19776_1#149
## quant_kallisto/19776_1#153
## quant_kallisto/19776_1#155
## quant_kallisto/19776_1#159
## quant_kallisto/19776_1#16
## quant_kallisto/19776_1#161
## quant_kallisto/19776_1#162
## quant_kallisto/19776_1#169
## quant_kallisto/19776_1#171
## quant_kallisto/19776_1#172
## quant_kallisto/19776_1#174
## quant_kallisto/19776_1#175
## quant_kallisto/19776_1#177
## quant_kallisto/19776_1#181
## quant_kallisto/19776_1#187
## quant_kallisto/19776_1#189
## quant_kallisto/19776_1#197
## quant_kallisto/19776_1#2
## quant_kallisto/19776_1#200
## quant_kallisto/19776_1#206
## quant_kallisto/19776_1#209
## quant_kallisto/19776_1#211
## quant_kallisto/19776_1#217
## quant_kallisto/19776_1#218
## quant_kallisto/19776_1#220
## quant_kallisto/19776_1#222
## quant_kallisto/19776_1#225
## quant_kallisto/19776_1#231
## quant_kallisto/19776_1#239
## quant_kallisto/19776_1#240
## quant_kallisto/19776_1#243
## quant_kallisto/19776_1#250
## quant_kallisto/19776_1#254
## quant_kallisto/19776_1#26
## quant_kallisto/19776_1#262
## quant_kallisto/19776_1#267
## quant_kallisto/19776_1#270
## quant_kallisto/19776_1#271
## quant_kallisto/19776_1#274
## quant_kallisto/19776_1#279
## quant_kallisto/19776_1#282
## quant_kallisto/19776_1#283
## quant_kallisto/19776_1#285
## quant_kallisto/19776_1#287
## quant_kallisto/19776_1#291
## quant_kallisto/19776_1#292
## quant_kallisto/19776_1#295
## quant_kallisto/19776_1#3
## quant_kallisto/19776_1#300
## quant_kallisto/19776_1#309
## quant_kallisto/19776_1#318
## quant_kallisto/19776_1#322
## quant_kallisto/19776_1#325
## quant_kallisto/19776_1#33
## quant_kallisto/19776_1#333
## quant_kallisto/19776_1#340
## quant_kallisto/19776_1#345
## quant_kallisto/19776_1#358
## quant_kallisto/19776_1#361
## quant_kallisto/19776_1#362
## quant_kallisto/19776_1#38
## quant_kallisto/19776_1#43
## quant_kallisto/19776_1#5
## quant_kallisto/19776_1#54
## quant_kallisto/19776_1#56
## quant_kallisto/19776_1#57
## quant_kallisto/19776_1#60
## quant_kallisto/19776_1#61
## quant_kallisto/19776_1#64
## quant_kallisto/19776_1#65
## quant_kallisto/19776_1#76
## quant_kallisto/19776_1#77
## quant_kallisto/19776_1#79
## quant_kallisto/19776_1#8
## quant_kallisto/19776_1#86
## quant_kallisto/19776_1#87
## quant_kallisto/19776_1#9
## quant_kallisto/19776_1#98
## quant_kallisto/19776_1#99
## quant_kallisto/19776_2#114
## quant_kallisto/19776_2#133
## quant_kallisto/19776_2#138
## quant_kallisto/19776_2#144
## quant_kallisto/19776_2#164
## quant_kallisto/19776_2#17
## quant_kallisto/19776_2#172
## quant_kallisto/19776_2#173
## quant_kallisto/19776_2#198
## quant_kallisto/19776_2#216
## quant_kallisto/19776_2#238
## quant_kallisto/19776_2#275
## quant_kallisto/19776_2#42
## Variables with highest loadings for PC1 and PC2:
##
## PC1 PC2
## --------------------------------- ----------- -----------
## pct_counts_top_100_features 0.4287933 -0.2472655
## pct_counts_feature_controls 0.0782467 -0.7800732
## n_detected_feature_controls -0.4194998 -0.3829379
## log10_counts_feature_controls -0.4489378 -0.3407878
## total_features -0.4568373 0.2510491
## log10_counts_endogenous_features -0.4730626 0.0673957
Produce a standard PCA too.
plotPCA(sce_gene_filt, size_by = "total_features",
colour_by = "pct_counts_feature_controls_ERCC")Look at which PCs are most correlated with certain QC metrics, such as…
Total features
plotQC(sce_gene_filt, type = "find-pcs",
variable = "total_features") % counts from top 50 features
plotQC(sce_gene_filt, type = "find-pcs",
variable = "pct_counts_top_50_features") sce_gene_filt$start_time <- NULL
zero_var <- matrixStats::rowVars(exprs(sce_gene_filt)) == 0
plotQC(sce_gene_filt[!zero_var,], "expl",
variables = c("pct_dropout", "total_features",
"pct_counts_top_200_features",
"pct_counts_feature_controls_ERCC",
"pct_counts_feature_controls_MT",
"n_detected_feature_controls",
"log10_counts_endogenous_features",
"log10_counts_feature_controls_ERCC",
"log10_counts_feature_controls_MT"))